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Abstract 

Given  scatterplot  data  {(Xj,  Fj),  (X2,  Y3),  ...(Xn,  Fn)},  Y  being  a  response  and  X  a  predictor,  a 
scatterplot  smoother  uses  local  averaging  to  estimate  the  dependence  of  Y  on  X.  A  simple  example 
is  the  running  lines  smoother,  which  fits  a  least  squares  line  to  the  the  Y  values  falling  in  a  window 
around  each  X  value.  A  smoother  generalizes  the  least  squares  line,  which  assumes  the  dependence  of 
Y  on  X  is  linear. 

In  this  paper,  we  extend  the  idea  of  local  averaging  to  likelihood  based  models.  One  such  applica¬ 
tion  is  to  the  class  of  generalized  linear  models  (Nelder  and  Wedderbum  (1972))  We  enlarge  this  class 
by  replacing  the  covariate  form  x/3  with  an  unspecified  smooth  function  «(*).  This  function  is  estimated 
from  the  data  by  a  technique  we  call  “ Local  Likelihood  Estimation a  type  of  local  averaging.  Multiple 
covariates  are  incorporated  through  a  forward  stepwise  algorithm. 

The  main  application  that  we  discuss,  however,  is  to  the  proportional  hazards  model  of  Cox 
(1972),  for  censored  data.  The  proportional  hazards  assumption  A(<  |x)  =  Ao(t)exp(z/3)  is  replaced 
by  A (t  |z)  =  A0(t)exp(s(a!)),  and  the  function  s(x)  is  estimated  from  the  data  by  local  likelihood 
estimation. 

In  a  number  of  real  data  examples,  the  local  likelihood  technique  proves  to  be  effective  in  uncov¬ 
ering  non-linear  dependencies. 

Finally,  we  give  some  asymptotic  results  for  local  likelihood  estimates  and  provide  some  methods 
for  inference. 
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2  Section  1:  Introduction 


LOCAL  LIKELIHOOD  ESTIMATION 

Robert  J.  Tibshirani 


1.  Introduction. 

Figure  (1)  contains  100  data  pairs  along  with  the  least  squares  line  summarizing  the 
relationship  of  a  response  (say  Y)  and  a  covariate  (X).  In  Figure  (2)  ,  the  least  squares  line 
has  been  replaced  by  a  “scatterplot  smooth.”  This  smooth  was  computed  by  a  type  of  local 
averaging —  around  each  X  value  a  window  of  20  points  was  formed  and  a  least  squares  line 
was  fit  to  the  points  in  the  window.  The  value  of  the  smooth  at  X  is  given  by  the  value  of  the 
“local  line”  at  X.  As  we  can  see,  the  smooth  captures  the  trend  of  the  data  better  than  the 
least  squares  line.  The  reason  is  simple —  the  smooth  doesn’t  make  a  rigid  assumption  about 
the  form  of  the  relationship  between  Y  and  X. 

In  recent  years,  there  has  been  a  great  deal  of  interest  in  scatterplot  smoothing  by  local 
averaging  (see  for  example  Cleveland(1979)  and  Friedman  and  Stuetzle(1981))  and  the  avail¬ 
ability  of  fast  computers  has  been  essential  in  this  development.  These  smooths  are  useful 
as  a  descriptive  tool  (as  we  have  seen  above)  and  also  as  building  blocks  for  non-parametric 
regression  models.  Important  developments  in  the  latter  area  can  be  found  in  Friedman  and 
Stuetzle  (1981)  and  Breiman  and  Friedman(1982). 

In  this  paper,  we  explore  an  application  of  smoothing  ideas  to  other  kinds  of  data.  In  par¬ 
ticular,  we  consider  (X,Y)  data  whose  relationship  is  expressible  through  a  likelihood  function. 
Take  for  example  the  situation  in  which  Y  is  a  0-1  response  and  X  is  a  covariate.  For  such  a 
data  set,  Figure  (3)  shows  the  logistic  regression  line,  estimated  by  maximum  likelihood.  On 
the  same  plot,  the  observed  logits  are  shown.  (Since  we  can’t  take  the  logit  of  0  or  1,  the  Y’s 
were  grouped  first).  In  Figure  (4)  ,  the  line  has  been  replaced  by  a  smooth.  As  was  the  case  in 
the  scatterplot  example,  the  smooth  does  abetter  job  of  capturing  the  relationship  between  Y 
and  X  than  the  line  does.  In  Figures  (5)  and  (6)  ,  we  see  another  example.  Figure  (5)  shows 
the  estimated  log  relative  risk  line  given  by  Cox’s  proportional  hazards  model,  applied  to  a  set 
of  survival  data.  In  Figure  (6)  ,  the  line  has  been  replaced  by  a  “log  relative  risk  smooth”. 

The  smooths  in  Figures  (4)  and  (6)  were  obtained  from  a  procedure  we  call  “local  likeli¬ 
hood”  estimation.  The  basic  idea  is  simple  extension  of  the  local  averaging  technique  used  in 
scatterplot  smoothing.  Given  a  global  method  for  estimating  a  linear  response  (e.g.  maximum 
likelihood  estimation  in  the  linear  logistic  model),  we  apply  it  locally,  estimating  a  separate 
line  in  a  window  around  each  X  value.  The  value  of  the  estimated  line  at  X  is  the  estimate  of 
the  smooth  response  function  at  X . 

By  varying  the  window  size,  we  can  control  the  smoothness  of  the  estimated  function.  The 
larger  the  windows,  the  smoother  the  estimated  function.  When  each  window  contains  100% 
of  the  data,  the  local  likelihood  procedure  corresponds  exactly  to  the  global  linear  method. 
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Hence  local  likelihood  completely  generalizes  linear  likelihood  estimation. 


This  paper  is  devoted  to  the  study  of  local  likelihood.  We  describe  the  method  in  general, 
showing  how  smooths  like  those  in  Figures  (4)  and  (6)  are  obtained,  and  we  will  study  some 
of  its  theoretical  properties.  In  the  exponential  family,  the  local  likelihood  method  extends 
the  class  of  generalized  linear  models  (Nelder  and  Wedderbum  (1972))  by  allowing  covariates 
to  enter  the  link  function  in  a  non-linear  fashion.  We  investigate  the  linear  logistic  model, 
a  member  of  this  class,  and  its  extension.  We  also  explore  in  depth  the  application  of  the 
method  to  the  proportional  hazards  model.  This  model  was  the  motivating  example  behind 
local  likelihood. 


The  sections  are  organized  as  follows.  Section  2  defines  the  local  likelihood  method  and 
describes  the  estimation  procedure  for  a  single  covariate.  We  also  discuss  a  forward  stepwise 
algorithm  for  building  multiple  covariate  models.  Section  3  contains  a  short  description  of 
the  application  of  local  likelihood  to  the  logistic  regression  model  for  binary  data.  Section 
4  describes  in  detail  the  application  of  the  local  likelihood  procedure  to  Cox’s  proportional 
hazards  model.  We  discuss  a  number  of  topics:  bootstrapping  the  models,  robustifying  the  fit, 
and  bias  of  the  procedure.  Finally,  some  real  data  examples  are  given. 


Section  5  provides  some  asymptotic  results  for  local  likelihood  estimates  in  the  exponential 
family.  Consistency  and  efficiency  of  the  estimates  are  discussed.  We  conjecture  (without 
proof)  similar  results  for  the  proportional  hazards  model. 


In  the  last  section  (6),  we  address  two  important  questions:  1)  how  many  parameters 
are  used  up  by  a  local  likelihood  smooth?  and  2)  is  it  reasonable  to  use  Akaike’s  Information 
Criterion  to  choose  the  window  size?  We  give  approximate  answers  to  these  questions,  backing 
up  our  claims  with  a  simulation  study. 


Figure  (1) 

Least  Squares  Line 
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Figure  (2) 
Scatterplot  Smooth 


Figure  (3) 

Logistic  Line 
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Figure  (4) 

Local  Likelihood  Logistic  Smooth 
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Figure  (5) 
Relative  Risk  Line 


Figure  (6) 

Local  Likelihood  Relative  Risk  Smooth 
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2.  Local  Likelihood —  A  description. 


2.1.  Introduction 

In  this  section  we  introduce  the  local  likelihood  idea.  Since  local  likelihood  estimation  is 
a  generalization  of  scatterplot  smoothing,  we  begin  with  a  review  of  the  latter. 


2.2.  A  Review  of  Scatterplot  Smoothing 


Given  independent  data  pairs  {(*1,  J/i),  ...(z„,  j/n)},  assumed  to  be  realizations  of  a  re¬ 
sponse  variable  Y  and  a  predictor  X,  a  scatterplot  smoother  produces  a  decomposition  of  the 
form 

Vi  =  *(*»•)  +  ft  (1) 

Here  «(•)  is  a  “smooth”  function  and  e,-  is  a  residual  error.  We  won’t  define  exactly  what 
“smooth”  means  here;  vaguely  speaking,  we’re  thinking  of  «(•)  as  a  function  less  smooth  than 
a  straight  line  but  smoother  than  an  interpolating  polynomial. 

There  are  many  ways  to  estimate  «(•) —  we’ll  concentrate  here  on  the  method  of  “local 
averaging”.  It  is  motivated  as  follows.  If  we  knew  the  joint  distribution  of  Y  and  X,  a 
reasonable  way  to  find  «(•)  would  be  to  minimize  E(Y  —  s(X))2,  where  the  expectation  is  taken 
over  this  joint  distribution.  Conditioning  on  X  =  x,  this  has  solution  s(x)  =  E(Y  \X  —  x)  for 
each  x.  In  practice,  we  don’t  know  this  joint  distribution  but  have  only  a  sample  from  it.  The 
idea,  then,  is  to  estimate  E(Y  \X  =  x)  from  the  data.  This  leads  to  the  class  of  local  average 
estimates  for  «(•): 

«(*.)  =  Ave/&Vis(xj)  (2) 

where  “Ave”  represents  some  averaging  operator  like  the  mean  and  Ni  is  a  “neighborhood”  of 
x,-  (a  set  of  indices  of  points  whose  x  values  are  “close”  to  a;,).  The  only  type  of  neighborhoods 
we’ll  consider  in  this  paper  are  symmetric  nearest  neighborhoods.  Assuming  that  the  data 
points  are  sorted  by  increasing  x  value,  these  are  defined  by: 


jfc  _  i  k  —  1 

Ni  =  {max(i - —  ,1),  ...,*  -  !,*,*'  +  1,  ...min{i  +  — —  ,n)} 


(3) 


The  parameter  k  is  called  the  span  of  the  smoother  and  controls  the  smoothness  of  the  resulting 
estimate.  The  value  of  k  must  be  chosen  in  some  way  from  the  data. 

If  Ave  stands  for  arithmetic  mean,  then  «(•)  is  the  running  mean,  the  simplest  possible 
scatterplot  smoother.  The  running  mean  is  not  a  satisfactory  smoother  because  it  creates 
large  biasses  at  the  endpoints  and  doesn’t  reproduce  straight  lines  (i.e.  if  the  data  lie  exactly 
along  a  straight  line,  the  smooth  of  the  data  will  not  be  a  straight  line).  A  slight  refinement  of 
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the  miming  average,  the  running  lines  smoother  alleviates  these  problems.  The  running  lines 
estimate  is  defined  by 

s(z«)  =  A>»  fiii%i  (4) 

where  /?oi  ^  Pu  are  the  least  squares  estimates  for  the  data  points  in  N{: 

Q  _  Hj&lfa-tdVi 

PU  IW*/-*)2  (5) 

fa  =  Vi  ~  Pli*i 

and  =  h  EyeJV,  */>  Vi  =  S  ^j€N,  Vi- 

The  running  lines  smooth  is  the  most  obvious  generalization  of  the  least  squares  line. 
When  every  neighborhood  contains  100%  of  the  data  points,  the  smooth  agrees  exactly  with 
the  least  squares  line.  For  smaller  spans,  it  produces  less  smooth  estimates.  Although  very 
simple  in  nature,  the  running  lines  smoother  produces  reasonable  results  and  has  the  advantage 
that  the  estimates  can  be  updated.  That  is,  to  find  s(zt*+i)  from  S(z,*),  only  a  0(1)  operation 
is  needed.  This  reduces  the  overall  algorithm  from  0(n2)  to  O(n). 


2*3*  Local  Gaussian  Smoothing 

Since  least  squares  estimation  corresponds  to  maximum  likelihood  when  the  data  are 
Gaussian,  it  is  not  surprising  that  the  running  lines  smoother  can  be  described  as  a  “running 
maximum  likelihood”  method  for  Gaussian  data.  Assume  as  before  that 

Vi  -  *(*•')  +  U  (6) 

and  in  addition  that  €,*  ~  .A/(0,<72).  Then  for  x  in  a  neighborhood  N{  of  z,*,  a  reasonable 
approximation  to  s(z)  is 

s{x)n0Oi  +  puz  (7) 

Considering  only  the  points  in  AT,-,  the  maximum  likelihood  estimates  of  /?0t  and  fiu  are  given 
by  (5)  .  Based  on  (7)  ,  this  gives  as  an  estimate  of  s(z»): 

s(zi)  —  fa  fiii^i  (8) 

Hence  running  lines  smoothing  corresponds  to  finding  approximate  maximum  likelihood  esti¬ 
mates  in  a  neighborhood  around  each  data  point. 

We  call  this  type  of  estimation  “LOCAL  LIKELIHOOD  ESTIMATION ”  or  “LOCAL 
LIKELIHOOD”  for  short.  In  this  paper,  we  extend  the  idea  of  local  likelihood  to  non-Gaussian 
likelihoods.  It  can  be  applied  in  principal  to  any  situation  in  which  the  effect  of  a  covariate  is 
modelled  through  a  likelihood.  In  fact,  as  we  will  see  in  the  proportional  hazards  model,  the 
“likelihood”  doesn’t  even  have  to  be  a  likelihood  in  the  strict  sense. 
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2.4.  Local  Likelihood:  General  Definition 

Suppose  we  have  n  data  tuples  of  the  form  ({/,-,  a:,-,  e,),  where  y  is  a  response  variable,  x 
is  a  covariate  or  predictor  variable,  and  t  is  a  vector  containing  any  additional  information. 
(In  censored  data  problems,  e  would  indicate  whether  y  is  censored;  in  many  problems  (like 
regression),  e  is  empty.)  Suppose  that  modelling  considerations  lead  to  maximization  of  a 
function  of  the  form 

HPo,Pl)  =  g{yi,V2,-VnJl,h,-0n,Cl,C2,-Cn)  (9) 

where  0,-  =  fto  +  Pixi-  For  example,  L(flo,Pi)  could  be  a  likelihood  function  and  the  estimates 
maximizing  L(Po,Pi)  would  be  the  maximum  likelihood  estimates.  The  LOCAL  LIKELI¬ 
HOOD  method  replaces  /?o  +  Pixi  with  an  arbitrary  smooth  function  «(*,): 

L{e{x  l),  *(*2),  •  •  •  «(*n))  =  g{y  i,V2, .  ■  • ,  y„,  h,  02, . . .  elf  e2, • .  •  «n)  (10) 

where  0,-  =  s(x,).  The  problem  is  to  estimate  «(•)  at  the  points  {xl,X2,—xn}.  Maximization 
of  L(8(xi),  s{x2),  ...»{xn))  results  in  an  unsatisfactory  estimate  due  to  overfitting.  In  many 
situations,  it  simply  reproduces  the  data.  As  an  alternative,  we  define  the  local  likelihood 
estimate  of  s(z,)  as 

«(zi)  =  Poi  +  PliXi  (11) 

where  ^0«  and  Pu  maximize  the  local  likelihood: 

LiiPoi,  Pit)  =  9{{.Vj i  Poi  "1"  Plixj )  }  i  j  ^  Hi)  (12) 

The  local  likelihood  procedure  produces  a  smooth  estimate  of  the  curve  «(•)  at  the  points 
{*1,2:2,  ...*„}•  It  avoids  overfitting  by  averaging  over  neighborhoods.  The  width  of  the  neigh¬ 
borhoods  (the  span)  controls  the  smoothness  of  the  resulting  estimate —  larger  spans  will  tend 
to  produce  smoother  curves. 

As  mentioned,  the  function  L(fi0,fi  1)  need  not  be  a  likelihood,  (in  Cox’s  model  it  is  a 
“partial  likelihood”),  but  in  any  case,  we  call  this  procedure  “Local  Likelihood”  estimation. 


2.5.  Local  Likelihood —  Definition  in  the  i.i.d.  Case 

In  the  i.i.d  case,  we  observe  n  independent  data  pairs  {(*1,  yi),  ...(*„,  y„)}  and  we  assume 
that  given  X  =  x,  Y  has  density 

Y  \x~f{Y,0)  (13) 

where  0  =  s(x).  The  likelihood  is  given  by: 

n 

L{8{xi),8(x2),...8(x„))  =  Y[f{Vj,0j) 


(14) 
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where  =  «(*/)• 

The  local  likelihood  estimate  of  «(z,)  is 

^i)  =  floi  d"  fluxi 

where  floi  and  flu  maximize  the  local  likelihood: 

Li  =  J|  f{Vj,floi  +  Piixj) 
jew 


(15) 


(16) 


2.6.  Asymptotic  Properties  of  Local  Likelihood  Estimates 

Other  than  the  fact  that  it  produces  smooth  estimates,  why  is  the  local  likelihood  proce¬ 
dure  reasonable?  On  a  heuristic  level,  it’s  easy  to  see  that  for  well  behaved  «(•)  functions,  if  fl0 
and  fli  are  consistent  estimators  in  the  global  case  (e.g.  m.l.e’s),  then  S(x,-)  will  be  consistent 
for  s(xi).  Consider  a  fixed  point  as,-.  As  n  — *•  oo  and  the  neighborhoods  shrink  in  such  a  way 
that  kn,  the  span  for  sample  size  n,  goes  to  infinity,  we  have  floi  — ►  floi,  flu  flu,  (floi  and  flu 
being  the  true  slope  and  intercept)  and  the  error  in  approximation  (7)  goes  to  zero.  Hence 
s(xi)  will  converge  to  a(z,).  When  floi  and  flu  are  local  maximum  likelihood  estimates,  the  lo¬ 
cal  likelihood  estimates  also  enjoy  the  optimality  properties  of  m.l.e’s.  They  are  asymptotically 
normal  and  first  order  efficient  with  respect  to  sample  size  kn.  These  properties  are  discussed 
in  Section  5. 


2.7.  The  Bias — Variance  Tradeoff 

The  span  parameter  controls  the  smoothness  of  the  estimated  function.  Larger  spans  will 
tend  to  produce  smoother,  less  variable  estimates,  but  these  estimates  will  tend  to  be  biassed 
if  the  underlying  function  is  non-linear.  Conversely,  smaller  spans  will  produce  less  biassed 
but  more  variable  estimates.  A  data-based  criterion  is  therefore  needed  to  select  the  span  that 
best  trades  off  these  two  factors  for  a  given  data  set.  We  describe  such  a  criterion  in  Section 
2.12. 


2.8.  Computation  of  Local  Likelihood  Estimates  in  the  i.i.d.  Case 


To  find  each  fli  =  {floi,  flu)  we  use  a  Newton- Raphson  search.  Let  Uflflfl)  be  the  2  by  1 
score  vector  with  jth  entry 

tW»)  =  (^) 

\  dPji  /  3=00 


(17) 
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and  /,(/? o)  be  the  2  by  2  observed  information  matrix  with  j'Jfeth  entry 


for  the  ith  local  likelihood  both  evaluated  at  some  point  0°.  Then  given  an  initial  guess  0init , 
the  Newton-Raphson  method  produces  the  new  trial  value: 

fir*  =  ptu + iCfir'r'uCpr')  (19) 

This  procedure  is  iterated  until  convergence.  It  is  used  to  find  0i  (and  hence  s(x,))  for  each 
neighborhood,  going  in  order  as  i  runs  from  1  to  n.  The  local  likelihood  estimate  is  used  as 
a  starting  value  for  the  maximation  of  L,+1;  because  the  estimates  don’t  tend  to  differ  much 
from  one  neighborhood  to  the  next,  convergence  is  typically  achieved  in  1  or  2  iterations. 

Since  an  0(k„)  operation  is  required  for  each  neighborhood,  the  the  entire  procedure  is 
0(n?)  (assuming  kn  ~  n).  This  is  not  a  problem  for  moderate  n  (say  n  ~  200)  because  of 
the  small  number  of  iterations  required.  For  larger  data  sets,  we  speed  up  the  procedure  by 
calculating  the  fit  only  every  mth  point;  this  reduces  the  running  time  by  about  a  factor  of  m. 
The  smooths  for  the  remaining  z-values  are  obtained  by  interpolation. 

The  scatterplot  smoother  of  Friedman  and  Stuetzle  (1982)  uses  updating  formula  to 
achieve  an  0(n)  algorithm.  We  have  been  unable  to  obtain  such  formulae  for  this  problem 
because  of  the  non-linear  nature  of  the  estimation. 


2.9.  Exponential  Family  Case 

A  special  case  of  the  above  occurs  when  /  is  a  member  of  the  exponential  family.  Then 
the  log  likelihood  has  the  form 

n 

log L  =  -  b(*})  -  e(yy,  °)}  (20) 

l 

where  9j  =  a(xj)  and  a  is  a  scale  parameter.  If  <r  is  unknown,  (20)  is  not  generally  an 
exponential  family  but  the  estimation  procedure  we  will  describe  is  unchanged  because  the 
local  score  function  for  0  doesn’t  involve  a. 

The  local  log  likelihood  is: 


log  Li  =  ~  b{9q)  ~  c(yy,  or)}  (21) 

jeNi 

where  9q  =  0oi  +  fiuxj-  Letting  X  represent  the  n  by  2  design  matrix  with  first  column 
(1,1,. ..1)*  and  second  column  (xi,x2,...x„y,  and  letting  W  =  diag{I(j  6  iV,)},  the  score 
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function  has  the  simple  form 


Ui{fii)  =  XtW(9-k,(Xfi))  (22) 

The  observed  information  is  7,(/J,)  =  X*Wb"  (XPi)X  and  the  Newton- Raphson  step  is: 

ft™  =  ftnit  +  If1{j}init)XtW{V  -  b'(Xpinit))  (23). 

In  the  above,  we  have  modelled  the  natural  parameter  9.  We  could  just  as  well  model 
some  other  parameter  (like  E(y))]  in  any  specific  problem,  there  may  be  reasons  to  prefer  one 
parametrization  to  another.  For  example,  in  the  binary  response  problem,  it  is  more  convenient 
to  model  the  natural  parameter  log  then  the  expectation  p  because  the  latter  would  require 
that  the  estimated  smooth  stay  between  0  and  1. 


2.10.  Relationship  to  Generalized  Linear  Models 

Model  (20)  can  be  viewed  as  a  extension  of  the  class  of  generalized  linear  models  (Nelder 
and  Wedderbum  (1972)).  A  generalized  linear  model  is  defined  by  Y  \x  /v  f(Y,9)zndE(Y)  = 
g{Po  +  Pix),  where  /  has  the  exponential  form  (20)  .  If  g  (the  “link  function”)  is  invertible, 
this  corresponds  to  g~1(E(Y))  =  f3a  +  ptx.  In  the  local  likelihood  set-up,  we  have  generalized 
Po  +  Pix  to  «(*). 


2.11.  Number  of  Parameters —  “Degrees  of  Freedom9 

In  Section  6,  we  discuss  an  approximate  method  for  determining  how  many  independent 
parameters  a  local  likelihood  smooth  is  really  fitting.  Since  the  local  likelihood  estimate  pro¬ 
duces  a  function  smoother  than  the  data,  we  would  expect  that  it  uses  less  than  n  independent 
parameters.  This  is  the  case.  Consider  a  running  lines  smoother  with  span  s.  Such  a  smoother 
is  linear  in  that  the  fit  y  can  be  written  as  P(s)jf  where  P(s)  is  a  smoother  matrix.  P(s)  will 
depend  on  the  set  of  x  values  observed,  as  well  as  the  span.  In  traditional  linear  least  squares 
estimation,  P(s)  is  the  hat  matrix  X(XtX)~lXt.  We  show  in  Section  6  that  for  a  running 
lines  smoother  with  span  a,  the  number  of  degrees  of  freedom  used  up  is  trace(P(s)).  (This 
result  and  related  results  are  also  given  in  Cleveland  (1979)).  We  also  show  that  for  any  local 
likelihood  fit  (in  the  exponential  family),  with  span  s,  the  number  of  degrees  of  freedom  is 
about  trace(P(s)).  Thus,  although  the  matrix  P(s)  is  only  used  in  the  estimation  process  of 
the  Gaussian  local  likelihood  model,  (and  not  in  the  estimation  of  other  local  likelihood  mod¬ 
els),  the  trace  of  this  matrix  turns  out  to  be  the  relevant  quantity  nonetheless.  Note  that  this 
generalizes  the  result  in  linear  estimation,  in  which  P(s)  is  an  idempotent  projection  matrix 
and  hence  trace(P(s))  =  rank(P(s))  =  p,  the  rank  of  the  column  space  of  X. 
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The  quantity  trace(P(a))  turns  out  to  be  significantly  less  than  n.  In  an  example  given 
in  Section  6,  with  100  data  points  and  a  =  .5,  traee{P(a))  is  3.65.  Thus  we  are  really  fitting 
only  3.65  “parameters*. 

We  also  note  in  Section  6,  however,  that  the  “number  of  parameters*  or  “degrees  of 
freedom*  should  be  used  only  as  a  rough  rule  of  thumb.  This  is  because  the  distribution 
involved  is  not  chi-squared  with  trace(P)  degrees  of  freedom,  but  is  more  spread  out.  If  a 
precise  estimate  of  the  distribution  is  required  (for  a  given  data  set),  a  simulation  technique 
described  in  Section  6  can  be  used. 


2.12.  Span  Selection 

The  estimation  of  a  local  likelihood  smooth  requires  the  choice  of  a  span  size.  In  scat- 
terplot  smoothing,  one  popular  method  for  choosing  the  span  size  is  cross-validation.  For  a 
number  a  trial  spans,  smooths  are  estimated  leaving  out  each  data  point  one  by  one.  A  cross- 
validation  sum  of  squares  is  calculated  and  the  span  having  the  smallest  value  is  selected.  This 
is  detailed  in  Friedman  and  Stuetzle  (1981). 

In  the  local  likelihood  problem,  cross-validation  turns  out  to  be  very  expensive  computa¬ 
tionally  .  As  an  alternative,  we  use  a  form  of  Akaike’s  Information  Criterion  ( AIC ).  In  fitting 
a  generalized  linear  model  with  maximized  likelihood  L  and  p  independent  parameters,  the 
AIC  is  defined  by: 

AIC  =  —  2  log  L  +  2p  (24) 

The  first  term  measures  the  goodness  of  fit  of  the  model,  while  the  second  term  penalizes  the 
number  of  parameters  used.  Hence  the  AIC  attempts  to  tradeoff  variability  and  bias. 

We  make  use  of  the  AIC  criterion  for  selecting  the  span  of  the  local  likelihood  smooth. 
Using  trace(P(a))  as  an  approximate  number  of  degrees  of  freedom,  the  span  size  a  is  selected 
to  minimize  AIC  based  on  the  value  of  the  global  likelihood  (10)  .  In  a  number  of  examples, 
we’ll  see  that  this  procedure  chooses  reasonable  span  sizes,  producing  estimates  that  aren’t  too 
jagged  nor  too  biassed.  We  justify  this  use  of  AIC  is  Section  6. 


2.13.  Effects  of  Sample  Size 

The  local  likelihood  technique,  being  non-parametric  in  nature,  will  work  best  in  larger 
samples  (say  n  >  100).  Sample  size  is  not  the  only  factor,  however —  the  amount  of  (non-linear) 
structure  in  the  data  relative  to  the  noise  component  is  also  important.  We’ve  found  that  the 
technique  can  pick  up  non-linear  structure  in  some  data  sets  with  as  few  as  60  observations. 
More  often,  however,  the  local  likelihood  algorithm  chooses  a  large  span  for  small  data  sets, 
and  the  resultant  estimate  closely  resembles  the  global  linear  estimate. 
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2.14.  Handling  of  Ties 

For  data  with  tied  x  values,  two  things  are  done.  First,  each  neighborhood  is  expanded 
(if  necessary)  to  ensure  that  if  a  point  j  is  in  a  given  neighborhood,  so  is  any  other  point  k 
having  x*  =  xy.  This  makes  the  estimation  procedure  invariant  to  the  incoming  order  of  the 
data  points.  Secondly,  the  smooths  for  each  of  the  tied  values  are  averaged  and  each  smooth 
value  is  assigned  the  average.  That  is,  if  xy  =  xy+i . . .  =  xy+m,  then  for  each  /  <  i  <  j  +  m, 
s(xt)  is  assigned  the  value  ]Cy+m  +  1)* 


2.15.  Multiple  Covariates  and  Backfitting 

The  above  discussion  shows  how  the  local  likelihood  idea  can  be  used  to  estimate  the 
smooth  for  a  single  covariate.  H  p  covariates  are  available,  there  are  two  ways  the  model  could 
be  generalized.  One  could  assume  9  =  «(•,•,...•••),  a  smooth  p  dimensional  function,  and 
then  estimate  s  by  local  averaging  in  p-dimensional  space.  This  procedure  would  suffer  from 
the  so-called  curse  of  dimensionality .  in  high  dimensions,  averaging  neighborhoods  grow  very 
large.  Consider  for  example  a  data  cloud  uniformly  distributed  in  a  ten  dimensional  unit  cube. 
Then  a  hybercube-  shaped  neighborhood  containing  10%  of  the  data  points  would  have  sides 
of  length  .8!  (since  .810  w  .1).  Hence  the  local  averaging  would  not  be  “local”  at  all. 

To  avoid  the  curse  of  dimensionality,  we  can  smooth  one  co-ordinate  at  a  time.  The 
model  takes  the  form  9  =  Sy=i  */(')•  This  model  is  less  general  than  the  p-dimensional 
smooth  model,  but  one  can  make  it  more  general  by  allowing  smooth  functions  of  products, 
e.g.  d(xt'Xy).  To  estimate  the  $y(-)’s,  a  forward  stepwise  algorithm  is  used.  This  algorithm  is 
analagous  to  a  forward  stepwise  regression  procedure.  The  algorithm  proceeds  by  smoothing 
on  each  variable,  and  selecting  the  smooth  that  most  improves  the  fit.  When  one  smooth 
is  selected,  the  remaining  variables  are  smoothed  and  the  one  that  most  improves  the  fit  is 
chosen.  The  process  is  repeated  until  no  new  variable  can  significantly  improve  the  fit. 

Now  suppose  that  this  procedure  selects  a  smooth  $i(*)  at  the  first  step  and  a  smooth 
$2(0  at  the  second  step.  Then  the  smooth  «i(*)  may  not  be  “optimal”  given  that  «2(*)  is  in  the 
model.  Hence  it  is  desirable  to  re-estimate  «i(*)  to  accomodate  S2(*)-  Now  given  the  adjusted 
estimate  $J(*),  we  can  adjust  $2(')  and  so  on,  iterating  until  convergence.  This  process  is 
called  “backfitting”  (Friedman  and  Stuetzle(1982)).  In  general,  with  more  than  two  smooths, 
whenever  a  new  smooth  is  entered  into  the  model,  the  smooths  already  in  the  model  are 
adjusted  to  accommodate  the  new  smooth.  Specifically,  all  but  one  of  the  smooths  are  held 
constant  and  the  remaining  smooth  is  re-estimated.  This  is  done  for  each  smooth  in  turn  until 
the  fit  no  longer  improves  by  a  significant  amount.  As  an  example,  suppose  a  new  smooth 
*r+i(  )  added  to  a  model  containing  smooths  «i(  ), . . .  «,.(■).  Then  the  backfitting  procedure 
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would  consist  of  estimating  s y(-)  in  the  model 

*  =  £«*(')  +  «/(•)  (25) 

treating  **(’)  as  a  constant.  This  is  done  for  j  running  from  1  to  r  +  1.  The  entire  cycle 
is  repeated  until  convergence. 

We  have  no  proof  of  convergence  for  the  backfitting  algorithm,  although  it  has  converged 
in  all  the  examples  that  we’ve  tried.  In  a  simple  linear  regression  framework,  with  p  (possibly 
non-orthogonal)  covariates  ®i,  Z2, . . .  zp,  one  can  show  that  backfitting  converges  to  the  correct 
answer  (Stuetzle  (1983),  personal  communication).  That  is,  if  we  project  the  current  residual 
vector  onto  each  covariate  in  turn,  the  residual  vector  converges  to  the  correct  residual  vector 
i.e.  the  response  minus  the  projection  of  the  response  onto  the  column  space  of  Z2, . . .  xp. 
Breiman  and  Friedman  (1982)  show  convergence  of  a  backfitting  algorithm  in  a  additive  smooth 
model,  for  a  restricted  class  of  smoothers. 


2.16.  How  do  we  select  terms  for  the  model? 

This  question  can  be  addressed  through  examination  of  —2  log  L(p),  but  it  is  customary 
in  generalized  linear  modelling  to  work  with  an  equivalent  measure,  the  “deviance”.  The 
deviance  is  2log(L(jf)/L(y))  which  equals  to  -2logL(£)  +  constant.  At  each  stage,  then,  we 
find  the  smooth  that  decreases  the  deviance  the  most.  This  smooth  is  then  added  to  the  model 
if  the  decrease  in  the  deviance  is  large  compared  to  the  number  of  “parameters”  used  up  by 
the  smooth. 

An  outline  of  the  resulting  algorithm  is: 


Forward  Stepwise  Algorithm 

While  (not  all  variables  have  been  selected) 

Find  the  smooth  that  decreases  deviance  the  most 

If  decrease  <  threshold 1  exit 

If  current  model  contains  more  than  one  smooth 

Backfit  smooths  until  decrease  in  deviance  <  threshold2 


End  While 
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Threshold\  can  be  set  to  a  value  determined  by  the  number  of  degrees  of  freedom,  or  set 
to  zero  to  allow  all  covariates  to  enter.  Thresholds  is  set  to  some  small  number  like  .001.  The 
output  of  the  algorithm  is  {«i(-)>*2()>  •••«*(•)}  where  h  is  the  number  of  smooths  selected. 


2.17.  The  Scale  Parameter  in  the  Exponential  Family  case 

The  exponential  form  (20)  may  or  may  not  contain  an  unknown  scale  parameter,  but  in 
any  case,  the  likelihood  estimation  procedure  is  unchanged  because  the  score  doesn’t  involve 
the  scale.  An  estimate  of  scale  is  needed,  however,  if  the  deviance  is  to  be  used  to  assess 
importance  of  model  terms.  As  is  true  for  standard  generalized  linear  models,  we  would  fit 
some  maximal  model  and  use  the  mean  deviance  as  our  estimate  of  scale.  This  could  be  used 
to  form  a  “scaled  deviance”,  proceeding  thereafter  as  if  the  scale  were  known. 

In  the  only  exponential  family  model  we  discuss  (the  logistic  model)  the  scale  is  a  function 
of  the  mean,  so  this  issue  doesn’t  arise.  Hence  we  will  not  go  into  scale  estimation  in  this  paper. 


2.18.  Other  Fitting  Procedures 

The  local  likelihood  procedure  uses  local  linear  estimation  because  it  works  well  (especially 
in  reducing  bias  at  the  endpoints)  and  is  simple.  More  sophisticated  kernel-type  estimates  could 
be  used  to  make  the  procedure  robust  and  increase  the  smoothness  of  the  estimated  function. 
Borrowing  ideas  from  scatterplot  smoothing  (see  e.g.  Cleveland  (1978)),  we  can  downweight 
points  based  both  on  their  distance  from  the  center  of  the  neighborhood  and  the  size  of  the 
residual.  While  we  don’t  discuss  such  a  procedure  in  general  terms,  a  downweighting  algorithm 
for  the  proportional  hazards  in  discussed  in  Section  4.11. 
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3.  Application  to  the  Logistic  Model. 


3.1.  Introduction 

In  Section  2,  we  discussed  how  the  local  likelihood  technique  could  be  applied  to  any- 
generalized  linear  model.  Probably  the  most  commonly  used  such  model  (besides  the  normal 
regression  model)  is  the  linear  logistic  model  for  binary  data.  In  this  section,  we’ll  briefly 
illustrate  the  local  likelihood  procedure  in  this  setting;  full  details  can  be  found  in  Hastie 
(1983) 


3.2.  The  Problem  and  a  Review  of  the  Linear  Logistic  Model 

We  have  data  of  the  form  {(yi,  Xi),(y2,  x2),  ..(yn,  *n)}  where  the  response  y  is  0  or  1  and 
x  is  an  explanatory  variable.  The  observations  are  assumed  to  be  independent.  The  problem 
is  to  investigate  the  dependence  of  y  on  x. 

Let  x  =  (1,  x)  and  let  p(x)  =  P(y  =  1  |x).  The  log  likelihood  of  the  data  is 

n 

log  L  =  ]T{y,logpy  +  (1  -  yy)  log(l  -  py)}  (26) 

;'=1 

where  py  =  p(asy).  Letting  X  represent  the  matrix  with  jth  row  equal  to  (l,zy),  the  score 
equation  has  the  form 

Xt(f-p)  =  0  (27) 

The  linear  logistic  model  assumes  that 

logit  p{x)  =  x*P  (28) 

Written  as  a  function  of  P,  the  log  likelihood  is 

n 

log  L(P)  =  £{y.-«|*  -  log(l  +  e*!*)}  (29) 

i= i 

A  Newton-Raphson  procedure  is  typically  used  to  find  p.  The  expected  information  matrix  is 

I(P)=X,Diag{pj(l-pj)}X  (30) 

and  the  Newton-Raphson  iteration  has  the  form 


Pnew  —  Pold  +  I  ^ {Po(d)X* —  pold) 


(31) 
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Section  S:  Application  to  the  Logistic  Model 


3.3.  The  Local  Likelihood  Generalization 

The  formulation  of  subsection  2.3  for  generalized  linear  models  can  be  applied  directly. 
Instead  of  assuming  a  linear  form  for  logit  p(x),  we  assume 

logit  p(z)  =  s(z)  (32) 

The  local  likelihood  for  x%  is 

log  Ltifli)  =  £  {yyaj-ft  -  log(l  +  (33) 

jew 

Letting  /?,*  maximize  log£,-(/9j),  the  local  likelihood  estimate  of  *(af-)  is  s(z#)  =  With 

multiple  covariates,  the  model  takes  the  form 

p 

logit  p(*)  =  J^«(-)  (34) 

1 

A  forward  stepwise  algorithm  is  used  to  select  covariates,  as  described  in  Section  2. 


3.4.  Example  1:  Breast  Cancer  Data 

A  study  conducted  between  1958  and  1970  at  the  University  of  Chicago’s  Billings  Hospital 
concerned  the  survival  of  patients  who  had  undergone  surgery  for  breast  cancer  (Haberman 
(1976)).  There  are  306  observations  on  4  variables. 

y= 1  if  patient  survived  >  5  years,  0  otherwise 
zi=age  of  patient  at  time  of  operation 
x2=year  of  operation 

33=number  of  positive  axillary  nodes  detected 

The  local  likelihood  procedure  applied  to  all  three  covariates  produced  the  smooths  shown 
in  Figures  (7)  ,  (8)  ,  and  (9)  .  Table  1  shows  the  decrease  in  deviance  due  to  each  variable. 
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Table  1.  Analysis  of  Breast  Cancer  Data 


Model 

Deviance 

Number  of  Parameters 

Constant 

353.67 

1 

Age(span=  .6) 

346.08 

2.41 

Age  +  Yr  of  Oper(span  .5) 

343.91 

2.41  +  2.54 

Age+Yr  of  oper+  #  of  nodes(span  .5) 

307.74 

2.41  +  2.54  +  2.41 

Age  and  number  of  nodes  are  important,  year  of  operation  is  not.  The  final  model  has  a 
deviance  of  307.74  on  (306-2. 41-2. 54*2. 41)=298. 54  degrees  of  freedom. 

Landwehr  et  al  (1984)  analyzed  this  data  set  to  explore  the  usefulness  of  partial  residual 
plots  in  identifying  parametric  forms  of  covariate  effects.  Their  final  model  was 

logit  p(x)  =  fa  +  +  x\f$2  +  x\Ps  +  *2/^4  +  *1*2)05  +  (foy(l  +  xs))Ps  (35) 

The  deviance  of  this  model  is  302.3  on  299  degrees  of  freedom.  The  fitted  terms  for  each 
covariate  are  super-imposed  on  Figures  (7)  ,  (8)  ,  and  (9)  (broken  lines).  The  functions  for  x\ 
and  *3  are  similar;  they  differ  for  x2,  but  the  overall  effect  of  this  variable  is  very  small. 

Hastie  (1983),  Hastie  (1984)  and  Hastie,  Tibshirani  and  Owen  (1984)  discuss  the  relative 
merits  of  the  local  likelihood  and  partial  residual  plot  procedures.  They  give  two  reasons  to 
suggest  why  the  local  likelihood  procedure  is  preferable: 

•  The  partial  residual  technique,  in  suggesting  the  parametric  form  for  a  covariate  effect, 
relies  on  the  assumption  that  the  covariate  forms  for  other  effects  are  correct.  Indeed 
these  effects  are  usually  assumed  to  be  linear.  The  local  likelihood  procedure  finds  the 
best  functional  form  for  all  covariates  simultaneously. 

•  The  partial  residual  technique  requires  quite  a  bit  of  ingenuity  in  identifying  the  various 
covariate  effects.  The  local  likelihood  procedure,  on  the  other  hand,  is  automatic. 


3.5.  Comparison  to  the  Scatterplot  Smoothing  Approach 

The  local  likelihood  method  extends  the  linear  logistic  model  through  a  type  of  local 
averaging  within  the  likelihood  framework.  Computationally,  it  would  seem  simpler  to  ignore 
the  fact  that  the  y’s  are  0’s  and  l’s  and  apply  scatterplot  smoothing  techniques  directly.  This 
works  fine  for  a  single  covariate:  a  scatterplot  smooth  of  y  on  x\  is  shown  in  Figure  (10)  .  On 
the  same  figure,  the  estimated  local  likelihood  probability  smooth  exp(  *(*i)/(l  +  exp(s(n)) 
is  shown  (broken  line).  Not  surprisingly,  the  two  smooths  are  very  similar. 

With  multiple  covariates,  the  local  likelihood  approach  is  more  attractive  for  precisely 
the  same  reasons  that  the  linear  logistic  model  has  gained  popularity.  In  fitting  a  model  of 
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the  form  y  —  Yli  8(xi),  one  would  have  to  ensure  that  the  fitted  probabilities  lie  between  0 
and  1.  This  would  require  some  sort  of  truncation  of  the  smooths.  On  the  other  hand,  the 
local  likelihood  approach  models  logit  p  so  the  fitted  probabilities  are  always  between  0  and  1. 
Secondly,  the  local  likelihood  approach  produces  an  additive  model  on  the  logit  scale.  A  large 
body  of  literature  suggests  that  for  many  types  of  data,  effects  are  more  likely  to  be  additive 
on  the  logit  scale  than  on  the  probability  scale.  One  could  try  to  adapt  the  regression  approach 
by  grouping  the  y's  then  using  the  logit  of  the  grouped  values  as  responses.  This  would  likely 
produce  similar  results  to  the  local  likelihood  approach  if  the  information  loss  due  to  grouping 
wasn’t  too  large. 


A  note  on  the  figures 

Many  of  the  figures  in  this  section  and  the  next  section  use  circles  to  represent  fitted 
smooth  values.  The  area  of  each  circle  is  proportional  to  the  number  of  data  points  it  represents. 
This  is  designed  to  indicate  the  density  of  points  in  each  part  of  the  smooth. 
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Figure  (8) 

Estimates  for  Year  of  operation 
Circlet:  L.L  smooth,  Broken  line:  parametric  function 
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Figure  (10) 

Estimates  for  Age 

Circles:  L.L  smooth ,  Broken  line:  Scatterplot  smooth 
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4.  Application  to  Censored  Data  and  the  Cox  Model. 


4.1.  Introduction 


In  the  censored  data  problem  we  observe  data  triples  (yy,*y, £,•),*  =  l,2,...n  where  £,•  is 
indicates  whether  or  not  the  response  yy  is  censored.  The  data  are  assumed  to  be  sorted  by 
the  covariate  x,  that  is  xx  <  x-i  <  . . .  <  x„.  The  proportional  hazards  model  of  Cox(1972) 
models  the  relationship  between  y  and  x  by  assuming  that  x  acts  on  the  hazard  function  in  a 
multiplicative  way: 

A(y  |  x)  =  A 0(y)  expire)  (36) 


where  Ao(y)  is  an  unspecified  function,  and  X(t  | m)  is  the  hazard  function  at  covariate  level  x 
defined  by 


X(t  Is)  =  lim 

1  '  At—*0 


P{t<T<t  +  At\T>  t,z) 
At 


(37) 


This  assumption  allows  P  to  be  estimated  independently  of  Ao(y)  by  maximizing  the  partial 
likelihood : 


pl- n 


Eyes.. 


(38) 


i£D 

where  D  is  the  set  of  indices  of  the  uncensored  y’s  and  /2y  is  the  risk  set  prior  to  yy.  The  local 
likelihood  generalization  of  (36)  is 


A (y  | a:)  =  A0(y)exp(«(z)) 


(39) 


and  the  partial  likelihood  of  the  data  is 


i€D 


exp  (s(g,)) 
Eye*,.  exp  («(zy)) 


(40) 


where  J2,-  =  {j  |yy  >  y,},  the  risk  set  at  time  y,-  —  0. 

To  estimate  d(zi),d(x2)> .  • .  a(xn),  we  apply  the  local  likelihood  technique  introduced  in 
Section  2.  As  before,  let  JV,-  be  a  symmetric  neighborhood  around  x,*: 

N%  =  {max(i  —  ^y^),l), ...»  -  l,t,»  +  1,.  ..mtn(t  +  (41) 

For  x  €  JVy  we  assume  «(z)  w  a,-  +  xfc,  and  the  local  partial  likelihood  for  the  data  in  JVy  is 


pl<=  n 

leDnNi 


exp  (ay  +  xi 

Eye^rw,  exP  +  xjPi) 


(42) 
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To  estimate  a,*  and  Pi,  we  maximize  PL,*.  Note,  however,  that  a,*  is  not  estimable  from 
PLi  since  the  ezp(a,*)  terms  cancel  one  another  giving 


pu=  n 

leDnNi 


exp  {xiPi) 

exP  ixiPi) 


(43) 


Let  Pi  maximize  £,(•).  Although  a,-  (thus  «(*,))  is  not  estimable  locally,  we  can  use  the  slope 
estimates  {Pi, . . .  fa}  to  estimate  {a(zi), . . .  «(a:n)},  as  follows.  We  have  e(x,-)  =  /*'  8'(t)dt  and 
8'(x)  =  Pi  for  x  e  Ni,  hence  to  estimate  «(*,•)  we  can  use  any  estimate  of  /*'  a\t)dt  based  on 
(*i>/?i)>  •  •  •  {xm Pn)-  Before  discussing  some  particular  integral  estimators,  it  is  important  to 
note  that  the  choice  of  c  is  arbritrary,  reflecting  the  fact  that  s(x)  is  only  determined  up  to  an 
additive  constant.  Substitution  of  a(x)  +  c  for  s(x)  in  (36)  doesn’t  change  the  model  because 
the  factor  ee  can  be  absorbed  into  the  arbritrary  function  Ao(t).  For  simplicity,  then,  we  define 
c  =  Xi,  so  that  i(xi)  =  0. 

To  estimate  /*'  a'(t)dt,  we  can  use  the  simple  rectangular  rule  defined  by 

I 

S(z,*)  =  }  ~  £/-i)  *  P%  (44) 

i 


for  t  >  1  and  5(xi)  =  0.  This  could  also  be  written  as  S(z,*)  =  (x,-  —  z,-_x)  *  /?,*,  so  that  the 
rectangular  rule  constructs  the  estimate  S(*)  by  joining  each  line  segment  to  the  previous  one, 
with  prescribed  slope  Pi 

For  greater  accuracy,  we  instead  use  the  trapezoidal  rule  defined  by 

*(*•)  =  *  —  +/*~^  (45) 
i  L 

for  t  >  1  and  a(xi)  =  0. 

The  procedure  is  summarized  in  the  following  algorithm: 
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Local  Likelihood  Smoother  for  the  Cox  Model 

For  i=l  to  n 

Find  fix  that  maximizes  PL,() 

End  For 

«(zi)  =  0 

For  i=2  to  n 

«(*•)  =  S',(*y  - 

End  For 

Output  {«(zi),  S[x2), . . .  «(&„)} 


4.2.  Significance  of  a  Smooth  and  “Degrees  of  Freedom” 

In  proportional  hazard  modelling,  the  “deviance”  has  no  obvious  analogue,  so  one  works 
directly  with  —2  log  PL  to  assess  significance  of  a  smooth.  The  “degrees  of  freedom”  of  the 
smooth  are  more  difficult  to  obtain,  however.  The  simulation  study  in  section  6  shows  that 
the  formula  trace(P)  is  biassed  downward  for  the  Cox  model.  Therefore,  we  find  the  mean 
deviance  decrease  by  the  simulation  technique  described  in  that  section.  The  trace  formula  is 
still  adequate  for  span  selection,  however,  since  biasses  will  tend  to  cancel  out  in  comparing 
two  spans. 


4.3.  Handling  Tied  Survival  Times 


For  data  with  tied  t,-  values,  we  use  the  approximation  suggested  by  Peto(1972)  and 
Breslow(1974)  for  the  partial  likelihood 


PL  »  H 

I 


exp  ( Zj  •  0) 

(EyeH(t|)  exp  (zy  •  fi))d> 


(46) 


where  d,-  equals  the  number  of  failures  at  t;  and  z,-  equals  the  sum  of  z,’s  for  items  failing  at 
t,-.  This  approximation  is  used  for  each  of  the  partial  likelihoods  PL,-(-). 
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4.4*  Multiple  Covariates 

With  more  than  one  covariate,  the  model  takes  the  form 

p 

\{t  | *)  =  A0(«)  exp (£ *;(•))  (47) 

y=i 

The  smooths  are  estimated  in  a  forward  stepwise  manner,  with  backfitting,  as  discussed  in 
Section  2. 


4.5.  Example  2:  The  Stanford  Heart  Transplant  Data 

The  first  example  that  we  will  use  for  illustration  of  this  technique  is  the  Stanford  Heart 
Transplant  Data,  as  reported  by  Miller  and  Halpem(1982).  There  are  157  observations  con¬ 
sisting  of  survival  time  after  transplant  and  two  covariates:  age  (in  years)  at  time  of  transplant 
and  T5  mismatch  score.  The  procedure  chose  a  span  size  of  .7  and  produced  the  smooth  shown 
in  Figure  (11)  .  The  actual  estimate  of  relative  risk  (ezp(a(-)))  is  shown  in  Figure  (12)  .  A 
summary  of  the  results  is  shown  in  Table  2. 

Table  2.  Stanford  Heart  Transplant  Data 

Analysis  of  Age 


Model 

—2 Log  Likelihood 

Number  of  Parameters 

Null 

902.40 

0 

Age  (linear) 

894.82 

1 

Age  +  Age2 

886.24 

2 

Age  (smooth,  span  .7) 

884.65 

2.95 

Piecewise  linear 

885.40 

2 

The  smooth  reduced  —2  log  PL  from  a  null  value  of  902.40  to  884.65.  For  comparison, 
a  standard  proportional  hazards  model  with  a  single  term  for  age  produced  a  value  of  894.82 
for  -2  log  PL  and  the  addition  of  a  quadratic  term  for  age  reduced  it  to  886.24.  The  resulting 
quadratic  function  is  shown  in  Figure  (11)  (broken  line).  The  smooth  in  Figure  (11)  suggests 
that  the  relative  risk  before  age  45  is  approximately  constant,  while  the  quadratic  curve, 
perhaps  misleadingly,  indicates  a  decrease  in  risk  before  age  45.  We  note  that  the  smooth 
produces  a  smaller  value  of  —2  log  PL  (by  1.6)  but  uses  .95  more  “parameters*. 

Based  on  Figure  (11)  ,  we  tried  to  summarize  «(•)  by  a  piecewise  linear  covariate  z  =  —.2 
for  age  <  44  and  z  =  .12  x  age  —  5.5  for  age  >  44.  Using  z  as  a  covariate  in  a  model  of  the 
form  Ao(t)  exp(/3$(x)),  a  standard  computer  program  for  fitting  proportional  hazards  models 
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produced  a  value  of  885.40  for  —2  log  PL.  This  provides  further  evidence  that  the  quadratic 
shape  for  the  relative  risk  may  not  be  realistic. 


4.6.  Stanford  Heart  Transplant  Data:  Age  and  T5 

The  forward  stepwise  algorithm  was  run  on  the  Stanford  Heart  Transplant  data  described 
in  Example  2.  The  smooths  for  each  variable  separately  are  shown  in  Figures  (11)  and  (13) 
The  threshold  value  was  set  to  zero  to  allow  both  variables  to  enter.  The  results  are  summarized 
in  Table  3. 


Table  S.  Stanford  Heart  Transplant  Data 

Analysis  of  Age  and  T5 


Model 

—2 Log  Likelihood 

Number  of  Parameters 

Null 

902.40 

0 

T5  (smooth,  span=  .7) 

899.99 

2.68 

Age  +  T5 

882.53 

2.95  +  2.68 

Age  +  T5  (backfit) 

882.52 

2.95  +  2.68 

Age  was  entered  first,  then  T5  mismatch  score.  The  smooth  for  T5  is  shown  in  Figure 
(14)  .  Backfitting  had  only  a  negligible  effect,  so  the  smooth  for  age  was  virtually  identical  to 
Figure  (11)  .  The  results  indicate  that  the  effect  of  T5,  after  adjusting  for  age,  is  very  slight. 


4.7.  Example  3:  Mouse  Leukemia  Data 

Kalbfleisch  and  Prentice  (1980)  analyzed  the  results  of  a  study  designed  to  examine  the 
genetic  and  viral  factors  that  may  influence  the  developement  of  spontaneous  leukemia  in 
AKR  mice.  The  original  data  set  contains  204  observations,  with  six  covariates  and  2  causes 
of  death(cancerous  and  non-cancerous)  measured.  Kalbfleisch  and  Prentice  perform  a  number 
of  analyses;  we  will  follow  one  of  them  here,  using  any  death  as  the  endpoint  and  the  four 
covariates: 

xi:  antibody  level  (%) 

X2~.  Gpd-1  phenotype 
*3:  sex(l=male,  2=female) 
x\:  coat  colour 
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Antibody  level  took  on  continuous  values,  although  about  half  of  the  mice  had  a  value  of 
0.  The  other  three  covariates  were  binary.  Of  the  204  observations,  4  had  missing  values  and 
were  discarded. 

Table  4  shows  the  results  of  forward  stepwise  local  likelihood  estimation  applied  to  these 

data. 


Table  4.  Mouse  Leukemia  Data 

Multivariate  Analysis 


Model 

—2 Log  Likelihood 

Number  of  Parameters 

Null 

1189.06 

0 

Antibody  (smooth,  span=  .5) 

1173.98 

1.85 

Antibody+Gpd-1 

1170.90 

1.85  +  1 

Antibody  (linear) 

1183.16 

1 

Antibody  (linear  +  quadratic) 

1183.07 

2 

Piecewise  linear 

1177.34 

2 

Each  of  GPD-1,  sex  and  coat  color  were  modelled  with  a  single  parameter.  Antibody 
was  the  most  important  factor,  reducing  -2  log  PL  by  15.08.  Gpd-1  was  next  in  importance 
but  not  significant  at  95%.  A  graph  of  the  estimated  smooth  for  antibody  in  shown  in  Figure 
(15)  (the  smooth  values  were  not  joined  so  that  the  distribution  of  antibody  levels  could  be 
seen).  It  is  markedly  non-linear,  changing  slope  at  antibody  level  =  7.5%.  Also  included  in 
Table  4  are  linear  and  quadratic  terms  for  antibody.  Even  with  a  quadratic  term,  the  fit  of 
the  parametric  Cox  model  is  significantly  worse  than  the  local  likelihood  smooth. 

Based  on  Figure  (15)  ,  a  piecewise  linear  covariate  was  created  by  joining  each  of  the 
left  and  rightmost  smooth  values  to  the  bending  point  by  straight  lines.  -2  log  PL  for  this 
covariate  was  1177.34,  still  significantly  worse  than  the  smooth  model.  This  indicates  that  the 
bowed  shape  of  the  smooth  between  antibody  levels  7.5%  and  80%  is  supported  by  the  data. 


4.8.  Bootstrapping  the  models 

To  assess  the  variability  of  an  estimated  relative  risk  curve,  the  bootstrap  (Efron  (1979)) 
can  be  applied.  As  in  the  regression  modelling,  there  are  (at  least)  two  ways  to  boot¬ 
strap:  we  can  resample  the  triples  (y,-,  x,,  £,•)  or  the  resample  the  residuals  (r,-,  S,j  (where 
r,'  =  Ao(y,)exp(«(a:,)))  and  add  them  back  to  the  fitted  model.  As  in  the  regression  case,  the 
second  method  assumes  that  the  fitted  model  is  correct. 

The  results  for  these  two  bootstrap  methods  applied  to  Example  2  are  shown  in  Figures 
(16)  and  (17)  .  20  bootstraps  were  computed  for  each  method.  The  two  plots  indicate  about 
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the  same  amount  of  variability  for  the  low  and  medium  age  groups,  but  somewhat  surprisingly, 
the  residual  sampling  shows  greater  variability  in  the  higher  ages.  The  use  of  the  bootstrap 
for  the  proportional  hazard  model  requires  further  study;  Efron(1980)  looks  at  the  bootstrap 
for  the  Kaplan-Meier  curve. 


4.9.  Case  Control  Data  and  a  Comparison  to  Thomas9  Method 

Thomas  (1983)  provides  a  method  of  finding  the  maximum  likelihood  estimate  of  r(s) 
in  the  proportional  hazards  model  A(t  \x)  =  Ao(t)r(:c)  subject  to  r(z)  monotone  in  x.  The 
algorithm  is  extremely  complex  and  not  fully  understood  by  this  author.  It  produces  a  step 
function  r(-)  with  steps  occurring  only  at  some  of  the  failures. 

Thomas  applied  his  algorithm  to  a  data  set  consisting  of  215  lung  cancer  cases,  each 
matched  with  5  controls,  sampled  from  a  large  cohort  of  Quebec  chrysotile  miners  and  millers. 
The  covariate  of  interest  was  total  dust  exposure.  The  effect  of  various  levels  of  dust  exposure 
was  desired  so  that  industry  standards  could  be  established. 

In  order  to  handle  case  control  data  of  this  type,  only  a  small  change  is  required  in  the 
local  likelihood  procedure.  The  local  partial  likelihood  simply  becomes  a  partial  likelihood 
for  case-control  data.  This,  in  turn,  is  the  same  as  the  partial  likelihood  for  prospective  data, 
except  that  each  risk  set  consists  of  a  case  and  its  associated  controls  (see  Prentice  and  Breslow 
(1978)  for  details).  It  turns  out  that  in  the  modified  local  likelihood  procedure,  a  case-control 
set  only  enters  into  the  partial  likelihood  for  a  given  neighborhood  if  the  case  and  at  least  one 
control  exist  in  the  neighborhood. 

Figure  (18)  shows  the  results  of  the  various  estimation  procedures  applied  to  the  lung 
cancer  data.*  The  solid  line  is  the  local  likelihood  smooth  ezp(s()),  and  the  step  function 
(dashed  line)  is  Thomas  monotone  m.l.e.  The  functions  are  in  qualitative  agreement,  with  the 
monotone  m.l.e  suffering  from  its  jagged  shape. 

The  advantages  of  the  local  likelihood  procedure  over  Thomas9  method  are  clear.  The 
monotone  m.l.e  is  not  smooth  and  is  forced  to  be  monotone.  As  well,  Thomas’  procedure  can 
handle  only  one  covariate.  The  local  likelihood  procedure  suffers  from  none  of  these  problems. 


*  Unfortunately,  we  could  only  obtain  a  slightly  smaller  data  set  from  Thomas,  consisting  of  188  of  the  215 
case-control  groups.  The  local  likelihood  procedure  was  applied  to  this  reduced  data  set,  while  Thomas1 
procedure  was  applied  to  the  full  data  set 
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4.10.  A  Bias  Study 

In  this  section  we  discuss  a  few  simulations  designed  to  investigate  how  well  the  proce¬ 
dure  estimates  the  true  underlying  function.  In  particular,  we  want  to  find  out  how  much  it 
underestimates  curvature  for  larger  spans,  especially  at  the  endpoints. 

A  sample  of  200  X  values  were  generated  from  Z/( — 1, 1),  and  survival  times  T  were 
generated  from  the  model  logT  =  5  +  4x2  +  e  where  e  had  the  extreme  value  distribution 
exp(e  -  exp(e)).  This  corresponds  to  the  hazard  model  A(t  ] m)  =  exp(-5  —  4a2).  Censoring 
times  C  were  then  generated  from  1/(0, 11),  and  the  observed  response  was  Y  =  min(T,C). 
This  resulted  in  an  average  censoring  rate  of  51  percent.  Our  aim  here  was  to  found  out  how 
well  the  procedure  reproduces  curvature  in  the  middle  of  the  covariate  range  (so  that  endpoint 
effects  don’t  enter  in).  Figure  (19)  shows  the  average  of  20  replications  (with  the  same  set  of 
x  values)  allowing  the  procedure  to  choose  the  span  by  the  AIC  criterion.  Since  the  functions 
are  determined  only  up  to  an  additive  constant,  they  were  translated  to  have  the  same  mean 
over  the  range  of  x.  The  average  smooth  captures  the  shape  of  the  true  function  remarkably 
well. 

Next,  we  investigated  the  effect  of  endpoint  bias.  We  generated  data  from  the  same  model 
as  above,  except  that  X  was  Z/(-l,  .5)  (We  cut  off  the  X  range  so  that  the  true  function  would 
be  non-linear  near  an  endpoint.)  Figure  (20)  shows  the  average  of  20  replications,  allowing 
the  procedure  to  choose  the  span.  The  average  smooth  underestimates  the  curvature,  but 
reproduces  the  function  quite  well. 

We  conclude  from  this  modest  study  that  the  procedure  has  low  bias,  with  a  tendency  to 
underestimate  curvature  slightly  at  the  endpoints. 


4.11.  A  Robust  Fit 

There  are  two  types  of  influential  points  that  can  create  problems  in  regression  modelling: 
outliers  in  time  space  and  outliers  in  covariate  space.  The  first  type  are  not  as  much  of  a 
problem  here  because  the  partial  likelihood  depends  only  on  the  ranks  of  the  survival  times. 
Still,  Cain  and  Lange  (1983)  give  an  example  in  which  a  few  large  survival  times  have  a  large 
effect  on  the  regression  coefficient. 

Outliers  in  covariate  space  are  potentially  more  dangerous.  Because  of  the  local  nature  of 
the  fitting,  it  will  not  be  as  much  a  problem  in  the  local  likelihood  model  as  it  is  in  the  linear 
proportional  hazards  model,  but  with  spans  as  large  as  .7n,  it  is  still  a  concern. 

A  simple  modification  of  the  fitting  procedure  can  help  reduce  the  effect  of  covariate 
outliers  in  both  the  standard  and  local  likelihood  proportional  hazard  models.  The  idea  is 
to  downweight  observations  based  on  their  distance  from  the  “center”  of  the  data.  This  idea 
is  exploited  in  the  bounded  influence  regression  literature  (see  Krasker  and  Welch  (1973) 
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and  the  references  therein).  In  order  to  define  a  “weighted”  partial  likelihood  estimate,  we 
need  to  define  the  partial  likelihood  for  a  sample  with  weights  «;,•  on  (y,-,  X{,  6,-),  t  =  1, 2, . . .  n. 

=  n).  Consider  first  the  case  where  there  are  no  ties  in  the  original  data.  It  is  natural  to 
require  that  when  the  w,-’s  are  integers,  the  weighted  partial  likelihood  should  exactly  coincide 
with  the  partial  likelihood  for  a  sample  with  to,-  copies  of  point  i.  A  form  suggested  by  Cain 
and  Lange  almost  satisfies  this  requirement: 


PLW  = 


n( 


exp(»iP ) 


£/e*.  v)iexp(zjp) 


) 


to, 


(48) 


When  the  weights  are  integers,  this  reduces  not  to  the  exact  partial  likelihood  for  the  corre¬ 
sponding  sample,  but  to  the  standard  approximation  for  tied  data  given  in  Section  3.2.  As 
long  as  each  to,-  is  small  compared  to  Syej?,  wi'  (M  ^  wiU  be  in  our  case)  this  approximation 
is  adequate.  When  the  original  data  contains  ties,  we  can  modify  (48)  : 


pLw  =  JJ  *jVjP) 

[Ey€fl,  to.-ezp^y^E,'  w‘ 


(49) 


where  cf,-  is  the  number  of  failures  at  y,-.  Expression  (49)  reduces  to  the  correct  (approximate) 
partial  likelihood  when  the  weights  are  integers. 

Maximation  of  (49)  with  appropriate  weights  provides  a  more  robust  fitting  procedure. 
Let  xe  be  some  “center”  of  the  covariate  space  and  let  vy  be  some  scaled  measure  of  distance 
of  *y  from  xe.  Then  a  reasonable  choice  of  weights  is  toy  ~  e-0-’  .  For  the  linear  proportional 
hazards  model,  it  would  be  natural  to  choose  ze  =  2  and 

Vj  =  x'jiX'X)-1!;  (50) 

In  the  univariate  case,  this  reduces  to 


v 


i 


(gy  z)2  1^ 

Ei(*y  ~  *)2  n 


(51) 


For  the  local  likelihood  extension  of  the  model,  we  can  use  partial  likelihood  form  (49) 
in  each  neighborhood,  and  weights  proportional  to  e-tr>  where 


(xj  xi)2  , 

E  -  *()*  K 


(52) 


Note  that  x;  is  used  as  the  center  of  the  neighborhood  instead  of  the  mean —  this  ensures  that 
points  near  the  ends  receive  large  weights  in  their  own  neighborhoods. 

Figure  (21)  shows  the  robust  version  of  the  local  likelihood  procedure  applied  to  age 
variable  (solid  line).  The  smooth  looks  very  similar  to  the  unweighted  smooth  (broken  line); 
this  is  not  surprising  since  there  are  no  outlying  ages  in  the  sample.  Figure  (22)  shows  the 
unweighted  smooth  (broken  line)  applied  to  the  sample  after  having  moved  a  failure  at  the 
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highest  age  (62)  to  92  (only  the  portion  of  the  the  smooth  from  ages  12  to  62  is  shown).  The 
weighted  smooth  (dotted  line)  looks  much  like  the  weighted  smooth  applied  to  the  original  data 
(solid  line,  same  as  solid  line  in  Figure  (21) ).  The  downweighting  has  successfully  reduced  the 
effect  of  the  outlying  point  on  the  overall  smooth.  Of  course,  the  weighting  scheme  described 
here  could  be  applied  within  the  parametric  setting,  but  we  haven’t  pursued  this. 

The  “robust ifying”  scheme  discussed  here  is  important  if  the  local  partial  likelihood  pro¬ 
cedure  is  to  be  used  in  “auto-pilot”  mode;  alternatively,  since  each  covariate  is  fit  separately, 
a  simple  scatter  diagram  of  each  covariate  should  reveal  any  large  outliers  in  covariate  space. 

In  the  theoretical  investigations  of  the  following  sections,  we’ll  restrict  attention  (for 
simplicity)  to  unweighted  smoothing  procedures. 


4.12.  Extending  the  Model 

There  are  (at  least)  two  ways  that  the  model  could  be  extended.  The  first  way  would 
be  to  allow  time-dependent  covariates.  In  principle,  this  would  be  straightforward;  as  in  the 
standard  proportional  hazards  model,  one  would  simply  insert  the  “current”  covariate  values 
when  constructing  each  term  of  the  partial  likelihood.  There  may  be  computational  problems 
with  this,  however.  With  fixed  covariates,  the  risk  sets  can  be  computed  by  “stripping  off” 
each  failure  or  censoring  as  they  occur.  With  time-dependent  covariates,  however,  the  risk  sets 
must  be  recomputed  for  each  failure  time.  This  would  increase  the  cost  by  about  a  factor  of  n. 
We  haven’t  tried  implementing  time-dependent  covariates;  this  may  be  pursued  in  subsequent 
research. 

Another  way  to  generalize  the  model  is  to  allow  linear  combinations  of  covariates  to  enter 
into  the  model.  The  form  of  the  model  would  be 

A(t  |*)  =  A0(t)exp(]T(«(a, •  •*)))  (53) 

The  vectors  a,-  could  be  found  by  a  numerical  search.  This  is  the  “Projection  Pursuit  Re¬ 
gression”  idea  introduced  by  Friedman  and  Stuetzle(1981).  Besides  the  obvious  computational 
cost,  this  model  would  suffer  from  a  lack  of  interpretability. 
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Figure  (12) 

Local  Likelihood  Estimate  of  Relative  risk  for  Age 
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Figure  (14) 

T5  smooth  with  age  in  the  model 
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Figure  (16) 

Bootstrap  smooths  (Resampling  (y,-,  z,-,  £,-)) 
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Figure  (18) 

Estimates  of  Relative  Risk  for  Lung  Cancer  Data 
Circles:  L.L  smooth,  Broken  line:  Monotone  m.l.e 


s(x 


44  Section  5:  Asymptotic  Resuits 


Figure  (20) 

Average  of  20  Local  likelihood  fits,  varying  span 
Solid  line:  L.L  fit.  Broken  line:  true  quadratic  function 
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Figure  (21) 

Solid  line:  Weighted  smooth:  no  outlier 
Broken  Line:  Unweighted  smooth,  no  outlier 
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Figure  (22) 

Solid  line:  Weighted  smooth,  no  outlier 
Broken  and  dotted  lines:  Weighted  and  unweighted  smooths  with  outlier 


age 
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5.  Asymptotic  Results. 


5.1.  Introduction 

Since  local  likelihood  estimates  in  the  exponential  family  are  maximum  likelihood  esti¬ 
mates  calculated  locally,  it  is  not  surprising  that  they  enjoy  (in  some  sense)  the  optimality 
properties  of  m.l.e’s.  Tibshirani(1984)  modifies  McCullagh’s  (1983)  results  for  generalized  lin¬ 
ear  models  to  establish  such  results  for  l.l.e’s  in  the  exponential  family.  We  summarize  these 
results  below.  Similar  results  should  be  obtainable  for  the  general  (non-exponential)  i.i.d  case 
and  for  the  Cox  model;  we  postulate  results  for  the  latter  at  the  end  of  this  section. 


5.2.  Consistency  and  Efficiency  of  LLE’s  in  the  Exponential  Family 

Consider  initially  a  sample  {(aq,  i/i), . . .  {xn, y„)}  containing  an  observation  at  a  point 
X  =  x0.  We  shall  investigate  asymptotic  properties  of  the  LLE  at  xq.  We  assume  that 
tfi»  V2>  •  •  ■  Vn  are  independent  realizations  of  random  variables  Yi,...Yn,  having  density 

Yi  ~  k(yi)  exp{y,-0,-  -  6(0,)  -  c(j/,-,  <r)}  (54) 

where  0,-  =  s(xt).  Let  k„  be  the  number  of  points  in  the  neighborhood  Ng  used  for  esti¬ 
mating  a(z0).  Assume  that  as  n  -»  oo,  k„  -*  oo,  but  the  neighborhood  shrinks  so  that 
max{i,ieN^}  \xi  ~  xj\  =  o(k n1^2)-  We  argue  below  that  for  estimation  of  the  slope  and  inter¬ 
cept  of  the  line  tangent  to  «(•)  at  xo,  the  LLE  is  consistent  and  asymptotically  normal,  and 
has  the  efficiency  of  a  MLE  based  on  sample  size  kn.  This  implies  that,  for  estimation  of 
s(x o),  the  LLE  has  minimum  asymptotic  mean  squared  error  among  all  estimates  based  on  kn 
observations. 

Letting  X  =  (1,*),  the  score  function  for  the  local  likelihood  at  xo  is 

Up  =  XtW(Y  -  aid))  (55) 

where  «(/?)  =  b'(Xp),  and  W  =  Diag{I(i  €  N£)}nXn. 

Let  u(/?)  =  E(Y)  =  b'(ff),  ip  =  Var(Up)  and  P  =  (0i,02)  be  the  coefficients  of  the  line 
tangent  to  «(•)  at  zq,  i.e.  fa  =  s'(zo)  and  Pi  =  s(zo)  —  ^2*0-  Then  under  regularity  conditions 
on  ip  and  the  third  moment  of  Y,  and  smoothness  constraints  on  «(•)  and  b(-),  we  have 

k~l/2Up  ~  o,  oHp!kn)  +  Op{k?f2)  (56) 

E(P-P)  =  0(k~1) 


(57) 
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and 

-  /»)  ~  Jfc(0,  KSif)  +  O, (*-*/*) 

These  imply  the  following  results  for  the  local  likelihood  estimate  «(®o)  =  Pi  +  02xo'- 

E(i(x0)-B(x0))  =  O(k-1) 

and 

*n/2(«(*o)  -  *(*o))  ~  M(0,kn(T2A)  +  Op(*“1/2) 
where  A  =  (1  a;o)*^1(l  so)*- 


(58) 


(59) 

(60) 


5.3.  Some  Remarks 

•  As  mentioned  previously,  we  proved  the  above  results  by  extending  McCullagh’s  (1983) 
results  for  generalized  linear  models,  McCullagh  starts  with  the  score  equation  D*V~(Y  — 
u(/?))  =  0  where  D  =  du/dfi  and  V  =  Cov(Y)a 2.  (This  reduces  to  the  form  X*(Y  -  u(/J)) 
when  the  link  function  is  such  that  $  =  Xfi).  From  this  he  proves  consistency  and 
asymptotic  normality  of  the  estimate  ji.  Also,  he  notes  that  to  obtain  the  asymptotic 
results,  it  is  not  necessary  to  assume  a  form  for  the  likelihood:  one  need  only  assume  that 
the  score  equation  has  the  form  D*V~(Y  -  u(/9)).  Since  this  equation  only  depends  on 
the  first  two  moments  of  Y,  there  can  be  more  than  one  likelihood  giving  the  same  score 
equation.  McCullagh  calls  any  likelihood  giving  this  score  function  a  “quasi-likelihood”. 
If  Y  is  in  the  exponential  family  and  the  log-likelihood  is  linear  in  ft,  then  the  likelihood 
and  quasi-likelihood  correspond.  In  other  cases,  there  can  be  more  them  one  likelihood 
resulting  in  the  same  quasi-likelihood.  In  this  event,  the  quasi-likelihood  estimate  may 
not  equal  the  MLE,  but  it  is  still  consistently  and  efficiently  estimates  the  true  parameter. 
According  to  McCullagh,  “quasi-likelihood”  estimation  could  be  useful  in  a  situation  in 
which  one  isn’t  willing  to  assume  a  specific  form  for  the  likelihood,  but  is  willing  to  specify 
a  relationship  between  the  mean  and  variance.  The  same  phenomenen  is  true  in  the  local 
likelihood  model —  we  need  only  assume  that  the  local  score  equation  has  form  (55)  , 
and  the  results  still  hold. 

•  The  results  above  assumed  that  the  maximum  distance  between  any  two  points  in  a 
neighborhood  goes  down  at  the  rate  o(k„  ^2).  In  the  local  likelihood  procedure,  the  span 
is  chosen  to  minimize  an  Akaike-type  criterion.  In  principle,  then,  one  should  show  that 
selecting  the  span  in  this  way  results  in  the  correct  order  of  shrinkage  of  the  neighborhood. 
We  haven’t  pursued  this,  however, 

•  We  have  established  convergence  results  for  the  estimate  of  a  single  value  of  the  smooth 
function.  With  considerably  more  work,  one  could  presumably  show  convergence  of  the 
entire  estimated  function  to  a  Gaussian  process. 
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5.4.  Asymptotics  for  the  Proportional  Hazards  Model 

In  this  section,  we  conjecture  an  asymptotic  result  for  the  local  likelihood  procedure  in 
the  proportional  hazards  model. 

Suppose  n  items  are  placed  on  test  and  give  rise  to  (possibly  censored)  observation  times 
{l/i >  1/2,  —l/fi}  with  associated  (fixed)  covariates  {zj,  *2-  -,  *n}-  Let  £,•  =  0  if  y,-  is  censored  and  1 
if  yi  is  uncensored,  and  following  Tsiatis(1980),  we  assume  that  the  triples  ({/,-,  x,-,  £,)  are  i.i.d. 
Let  D  be  the  set  of  indices  of  the  failures  among  the  y,-’s.  To  facilitate  construction  of  a  partial 
likelihood,  we  will  make  the  usual  assumption  that  the  censoring  mechanism  is  non-informative 
(see  Kalbfleisch  and  Prentice(1980)). 


Under  the  model 


A(t  |z)  =  A0(t)exp(z£) 

(61) 

the  partial  likelihood  is 

(62) 

and  the  score  function  is 

t<ED  \  e  ) 

(63) 

Tsiatis  shows  that  there  exist 

a  consistent  root  /?  of  the  score  equation  such  that 

n^20  —  0)  —*  M  (0,  v) 

(64) 

where  v  =  /0r°  —dQVar(z  |i?(t)),  Q(t)  =  P(t  >  t,S  =  1),  and  To  is  an  upper  bound  on  Y\ 
In  the  local  likelihood  framework,  we  assume  that  the  hazard  has  the  form 


A(t  |z)  =  A0(t)  exp  («(z)) 


(65) 


where  s(x)  is  some  smooth  function  of  z.  The  derivative  a'(zo)  at  some  fixed  point  zo  is 
estimated  by  /?o  maximizing  the  local  partial  likelihood 


pl>-  n 

leDnNs 


ePo*i 

eii,rw0"  e/?0*; 


(66) 


The  local  score  equation  is 


«o(^o)  = 


leDnNg  \ 


^ieRinN"  xje^oXl  \ 
SyeRifWj  e/?0*'  / 


(67) 


As  in  the  exponential  family  case,  we  assume  that  as  n  — »  oo,  k„  — ►  oo  and 
max{ijeN£}  \xi  ~  Xj\  =  o(kn1^2).  A  reasonable  conjecture,  under  regularity  conditions  on 
«(•),  is  that  the  local  score  equation  has  a  consistent  root  fio,  and  asymptotically 


k}/2(Po  ~  «*(*<>))  A/(0,  t>) 


(68) 
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where  v  is  defined  above.  Fixing  s(x')  =  s(x')  =  0  for  some  x',  a  result  like  (68)  could  also  be 
obtained  for  s(zo)>  This  would  require  a  convergence  proof  for  the  integral  estimator  s(zo)  = 
J*,°  s'(t)dt,  and  hence  consideration  of  the  simultaneous  estimation  of  <(•)  at  z1(  z2)  •  •  •  xn  We 
will  not  attempt  to  prove  these  results;  the  simpler  case  treated  by  Tsiatis  is  quite  involved. 
Recently,  more  general  results  for  the  proportional  hazards  model  (not  requiring  that  the 
triples  (y, •,£»,£»)  be  i.i.d)  have  been  obtained  using  a  martingale  approach  by  Anderson  and 
Gill  (1982).  A  modification  of  those  results  to  local  likelihood  estimation  should  also  be 
possible. 

As  for  efficiency  considerations,  Efron(1977)  and  Oakes(1977)  show  that  the  partial  like¬ 
lihood  estimate  has  good  asymptotic  efficiency  relative  to  the  full  likelihood  estimate  if  the 
family  of  hazard  models  is  rich  enough.  Similar  results  may  be  obtainable  for  the  local  partial 
likelihood  estimate. 
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6.  Degrees  of  Freedom  and  AIC  Approximations. 

In  the  data  analyses  of  the  preceding  sections,  we  have  used  the  formula 
degrees  of  freedom  ns  trace(P).  Here  we  state  more  precisely  the  result  that  is  shown  in 
Tibshiram(1984)  and  discuss  a  small  simulation  study  to  check  its  accuracy.  Finally,  we  jus¬ 
tify  the  use  of  the  AIC  is  choosing  the  span.  As  before,  we  concentrate  on  the  exponential 
family  case,  although  our  simulations  suggest  that  similar  results  might  be  obtainable  in  the 
proportional  hazards  model  as  well. 

We  assume  that  the  y,’s  are  independent  with  density  of  the  exponential  family  form 

90i{y%)  =  exp(yi$i  ~  Wi)  ~  e(y,-, <r))  (69) 

with  respect  to  some  carrier  measure.  The  scale  parameter  a  plays  no  special  role  and  is 
assumed  to  be  1  for  simplicity.  Let  kg(p)  =  90i(Vi),  and  let  b(9)  =  (fe(^i),  6(^2),  -  •  -  &(^«)). 

It  will  be  convenient  to  instead  index  Jfe(y)  by  the  expectation  parameter  ft  =  E$y  =  6'(0). 
The  deviance  between  y  and  y  is  defined  by 

D(.V,V)  =  2[logk>(y)  -  log  *$(?)]  (70) 

A  single  covariate  z  =  (xi,x2, . . .  x„)  is  available. 

Central  in  our  discussion  will  be  the  “smoother  matrix”  P  “corresponding”  to  a  local 
likelihood  fit  p.  Given  a  linear  covariate  vector  z,  the  running  lines  smooth  of  a  data  vector 
p  ,  based  on  z,  can  be  written  as  p  =  Pp.  We  call  P  a  “smoother  matrix”;  it  will  depend  on 
z  and  on  the  span  of  the  smoother.  Given  a  running  lines  smoothing  algorithm,  it  is  easy  to 
produce  P —  the  ith  row  of  P  is  the  output  of  the  smoother  applied  to  the  ith  unit  vector. 

For  Gaussian  data,  the  local  likelihood  fit  is  simply  p  =  Pp.  For  non-Gaussian  likelihoods, 
the  matrix  P  doesn’t  enter  in  the  local  likelihood  estimation  procedure  explicitly,  but  its  trace 
turns  out  to  be  important  nonetheless. 

Now  let  pi  and  p2  be  two  local  likelihood  fits,  based  on  z,  with  corresponding  smoother 
matrices  Pi  and  P 2.  Suppose  that  the  two  smooths  produce  the  same  fit  on  the  average,  i.e. 

(^jr, )  =  E(6y3),  where  in  obvious  notation,  0p  =  V  *(y).  Then  it  is  shown  in  Tibshirani 
(1984)  that 

E(D{p,  ft)  -  D{p,  p2))  ns  traee{P2)  -  trace(Px )  (71) 

This  result  is  exact  for  the  Gaussian  case,  where  the  deviance  is  the  residual  sum  of  squares, 
and  approximate  for  other  exponential  family  likelihoods.  We  describe  in  the  next  section  a 
simulation  study  to  assess  the  accuracy  of  this  approximation, 

Note  how  (71)  generalizes  the  standard  result  for  global  linear  fitting.  Consider  a  normal 
linear  regression  model,  with  variance  equal  to  1.  Let  pi  and  p2  represent  the  fitted  vectors 
corresponding  to  nested  subspaces  of  rank  pi  and  p2,  with  Pi  <  p2.  Then  if  the  true  mean 
response  lies  in  the  smaller  space,  then  the  decrease  in  residual  sum  of  squares  due  to  fitting  the 
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larger  model  is  Xpt-Pl-  Result  (71)  expresses  this  in  expectation,  for  Kullback-Leibler  distance 
is  Euclidean  distance  in  the  Gaussian  model,  and  Pi  and  P2  are  the  matrices  that  project  onto 
the  corresponding  subspaces  (recall  that  that  for  a  projection  matrix,  traee(P)  =  ronJfc(P)). 
For  linear  fitting  in  other  exponential  family  models  (GLM’s),  result  (71)  expresses  Wald’s 
theorem  in  expectation. 


6.1.  Degrees  of  Freedom  Simulations 

Table  5  shows  the  results  of  a  modest  simulation  study  designed  to  check  the  accuracy  of 
the  formula  E(D(p,p i)  -  D(p,p2))  =  traee(P2)  -  trace(Pi). 


Table  5.  Results  of  Degrees  of  Freedom  Simulation 


Entries  in  Lines  (2) — (5)  are  mean(variance)  of  deviance  decrease 


Span 

Source _ £ _ A _ *5 _ JJ _ j_ 


(1)  Trace(P )  —  1 

4.09 

3.32 

2.65 

2.34 

2.16 

(2)  Scatterplot  Smoothly  normal) 

4.14(10.00) 

3.39(7.75) 

2.61(6.03) 

2.31(5.08) 

2.09(4.32) 

(3)  Scatterplot  Smoothly  uniform) 

4.19(10.06) 

3.46(8.50) 

2.77(6.52) 

2.41(5.79) 

2.21(4.99) 

(4)  Logistic  Model  (const  ant  vs  smooth) 

4.34(13.47) 

3.40(11.62) 

2.72(9.12) 

2.2°.  (7.51) 

2.17(6.28) 

(5)  Logistic  Model  (linear  vs  smooth) 

3.29(11.71) 

2.25(8.25) 

1.63(6.21) 

1.29(4.58) 

1.12(2.89) 

(6)  Cox  Model  (no  censoring) 

5.58(13.37) 

4.24(8.99) 

3.63(7.52) 

3.12(6.25) 

2.71(5.48) 

(7)  Cox  Model (40%  censoring) 

5.36(13.54) 

4.16(9.04) 

3.62(6.98) 

3.13(5.86) 

2.73(5.20) 

The  numbers  in  the  table  were  obtained  as  follows.  100  x  values  were  generated  from 
(0, 1)  and  fixed  for  the  entire  table.  Given  these  x  values,  we  constructed  the  running  lines 
smoother  matrices  for  the  indicated  spans,  and  the  trace  of  each  matrix  (minus  1)  is  shown  in 
line  (1). 

Consider  for  example  the  entry  4.09  in  the  top  left  hand  comer.  According  to  the  preced¬ 
ing  derivation,  this  should  be  the  expected  decrease  in  deviance  due  to  fitting  a  local  likelihood 
model  with  that  span  .3  versus  a  model  with  only  a  constant. 

To  obtain  line  (2),  we  generated  100  yf s  from  M(0, 1)  and  computed  P(p,yl)  -  R{p,p), 
P  being  the  fit  from  a  scatterplot  smoother  (p  =  Pp)  with  span  as  shown.  Line(2)  shows  the 
mean  and  variance  from  500  such  repetitions  of  this  process. 
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Line  (3)  was  obtained  in  same  way  as  line  (2),  except  that  the  y.’s  were  generated  from 
uniform  (— y/3,  \/3),  the  range  chosen  so  that  V  or  (j/,)  =  1. 

To  obtain  line(4),  we  generated  100  y,-’s  from  binomial{  1,1/2)  and  fit  a  smooth  logistic 
model  with  spans  of  .3  to  .7.  The  numbers  show  the  mean  and  variance  of  D[p,  yl)  —  D{p,  y) 
over  500  repetitions. 

Line  (5)  was  generrted  in  a  similar  fashion  as  line  (4),  showing  instead  the  mean 
variance  of  D(p,  yj)  —  D(p,  y),  y/  being  the  linear  logistic  fit,  with  y,-  generated  from  a  linear 
logistic  model,  P(y,-  =  1  \x)  =  e2*/(l  +  e-2*). 

Lines  (6)  and  (7)  show  simulation  results  for  the  Cox  model.  100  y  values  were  generated 
according  to  y  =  exp(l  +  e),  where  e  had  an  extreme  value  distribution.  This  corresponds 
to  a  constant  hazard  (exponential)  model.  For  line  (6),  no  censoring  was  applied.  For  line 
(7),  censoring  variables  c,-  were  generated  from  e®“,  U  rs>/  2/(0, 1).  This  produced  a  censoring 
rate  of  about  40%.  A  smooth  Cox  model  was  fit  and  the  quantity  —2  log  L(null  model)  — 
(~2  log  L(amooth))  was  computed.  Lines  (6)  and  (7)  show  the  mean  and  variance  of  this 
quantity  over  500  repetitions. 

Note  that  for  all  the  models,  a  span  of  100%  gives  either  exactly  or  asymptotically  a  mean 
value  of  1  (by  Wald’s  theorem),  and  traee(P)  —  1  is  also  equal  to  1. 

The  results  give  fairly  strong  support  to  the  approximation  E(D(p,pi)  —  D(p,p 2))  = 
trace(P 2)  —  trace(Pi),  Lines  (2)  and  (3)  agree  well  with  (1),  not  surprising  since  the  approx¬ 
imation  is  exact  for  scatterplot  smoothers.  Line  (4)  also  is  in  good  agreement,  with  a  small 
upward  bias  for  smaller  spans.  Line  (5)  should  be  1  less  than  line  (1),  (since  the  global  linear 
fit  uses  2  degrees  of  freedom)  and  the  results  indicate  that.  For  the  Cox  model,  the  trace 
formula  is  biased  downward,  especially  in  the  lower  spans. 

The  variance  results  are  a  little  unsettling.  The  variance  to  mean  ratio  is  often  greater 
than  2  (the  ratio  for  a  chi-square  variate),  especially  for  the  non-gaussian  models. 

We  conclude  from  these  simulations  that  the  approximation  E(D(p,pi)  -  D(p,  y2))  = 
trace(P2)  —  trace(Pt)  i3  satisfactory  as  a  rough  rule  of  thumb,  for  the  gaussian  and  logistic 
models.  We  do  note,  however,  that  the  distribution  of  this  decrease  is  more  spread  out  than 
a  chi-square  variate  with  the  corresponding  degrees  of  freedom,  so  that  tests  based  on  the 
percentile  points  will  be  too  liberal.  The  downward  bias  of  traee(P)  the  Cox  model  does  not 
cause  a  serious  problem  in  the  A.IG  formula  —  21og£r  +  2trace(P),  since  the  bias  will  tend  to 
cancel  out  in  comparing  the  A1G  for  two  spans.  In  assessing  the  significance  of  a  Cox  smooth 
in  real  examples,  however,  we  find  the  mean  decrease  by  simulation  as  in  Table  5.  Indeed, 
for  any  of  the  models,  it  may  be  preferable  to  get  “exactly”  the  distribution  of  the  decrease 
by  simulation,  for  these  simulations  are  not  expensive.  Finally,  we  would  like  to  acknowledge 
that  the  idea  of  finding  degrees  of  freedom  by  simulation  for  a  scatterplot  smoother  was  first 
suggested  to  the  author  by  Arthur  Owen. 
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6.2.  Aka  Ike’s  Information  Criterion  {AIC)  For  Span  Selection 

Using  the  results  of  the  previous  section,  we  show  in  this  section  that  it’s  reasonable  to 
use  an  AIC  criterion  to  choose  the  span  in  the  local  likelihood  estimation  procedure. 

Let’s  briefly  review  the  AIC  for  a  parametric  model.  Given  a  model  kp,  suppose  we 
can  choose  among  maximum  likelihood  estimates  •••/*{  based  on  Pi,P2,—Pi  degrees  of 

freedom  respectively.  Suppose  also  that  each  model  can  be  considered  a  sub-model  of  a  true 
model  kp0.  Then  Akaike’s  information  criterion  (AIC)  (Akaike  1973)  specifies  that  we  should 
choose  the  estimated  model  that  minimizes 

AIC  =  -2  log  *£.($)  +  2 pi  (72) 


Akaike  derived  the  AIC  by  showing  that  E(AIC)  ta  E(D(po,  /*,-))  +  constant.  Hence  the 
model  that  minimizes  AIC  approximately  minimizes  the  expected  Kullback-Leibler  distance 
from  the  true  model. 

FVom  the  form  of  the  AIC,  it  is  clear  that  it  attempts  to  trade-off  goodness  of  fit  of 
the  estimated  model  with  its  complexity.  Not  surprisingly,  it  turns  out  to  be  identical  to 
Mallow’s  Cp  in  the  linear  regression  setting  and  asymptotically  equivalent  to  the  cross- validated 
likelihood  technique  in  general  (see  Stone  (1977)  for  these  results). 

In  the  local  likelihood  procedure,  we  propose  choice  of  the  span  parameter  s  to  minimize 

AIC  =  -2  log  (y)  +  2  traee(P(s))  (73) 

where  P(s)  denotes  the  smoother  matrix  producing  running  lines  fits  with  span  s,  and  jr(a) 
denotes  the  corresponding  fitted  values.  This  makes  sense  intuitively:  as  the  span  s  increases, 
the  first  term  will  (generally)  increase  but  the  degrees  of  freedom  traee(P(s))  will  decrease. 
Hence  the  AIC  will  trade  off  lack  of  fit  with  complexity  of  the  smooth. 

In  Tibshiram  (1984),  we  show  that  the  AIC  approximately  equals  a  measure  of  expected 
distance  to  the  true  model.  The  logic  of  the  derivation  follows  that  of  Akaike  (1973).  Consider 
the  exponential  family  set-up  of  the  previous  section.  Using  the  notation  of  that  section,  we 
let  P  be  a  running  lines  smoother  matrix  corresponding  to  some  span.  Then  we  show  that 

E(D(ito,y))  «  -n  +  2traee(P)  (74) 

Noting  that  D(y,y))  =  —  2logA^(p))  +  constant  and  also  that  n  is  constant  for  all  spans,  we 
see  that  the  fit  y  minimizing  the  AIC  criterion  (73)  ,  minimizes  an  estimate  of  distance  to 
the  true  model  kp0. 
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Software 

Fortran  programs  that  compute  local  likelihood  estimates  in  the  Cox  model  and  in  gen¬ 
eralized  linear  models,  are  available  from  the  author. 
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